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This is a supplementary material to the paper [5]. 

It contains technical discussions and/or results adapted from published pa- 
pers: in section [2j we show that geometrically ergodic Markov chains satisfy the 
assumptions H???? and H??; in sections [3] and [4j we provide results - useful 
for the proofs of some theorems in [8] - which are close to existing results in the 
literature. 

It also contains in Section [5] additional plots for the numerical analyses in 
(U Section 3]. 

To make this supplement paper as self-contained as possible, we decided to 
rewrite in Section [T] the model and the main definitions introduced in [8]. 

1 Assumptions and Model 

Let Y = {Y t }t£z be the observation process defined on J 7 ) and taking 

values in Y z where Y is a general space endowed with a countably generated 
cr-field B{Y). 

A HMM model parameterized by 8, for 9 in a set 6 C M. de , is fitted to 
the observations: consider a family of transition kernels {mg{x, x / )d\(x / )}g^Q 
onto X x £>(X) where X is a general state-space equipped with a countably 
generated tr-field 2?(X), and A is a bounded non-negative measure on (X,Z?(X)). 
Let {ge{x, y)dv(y)}eee be a family of transition kernels on (X x B(Y)), where 
v is a measure on (Y, B(Y)). 

It is assumed: 

HI (a) There exist continuous functions <f> : 9 — > M, ip : 6 — > M. d and 
S*:XxXxY^M d s.t. 

log mg (x, x' ) +\ogg e (x', y) = </>(&) + (S(x, x,' , y), ip(9)) , 
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where (•, •) denotes the scalar product on M. d . 

(b) There exists an open subset S of M. d that contains the convex hull of 
S(X x X x Y). 

(c) There exists a continuous function 9 : S — ¥ s.t. for any s£5, 

9(a) = argmax eG e {^(0) + (a, tp(8))} . 

H2 There exist cr_ and er + s.t. for any (x, x') £ X 2 and any 9 £ 0, < er_ < 
mg(x, x 1 ) < o + . Set p = 1 — (er_/cr+) . 

H3-( 7 ) [sup X)X , ex2 ^(s.i'.Yojr] < +oo. 

H4 (a) Under P*, Y is a stationary sequence. 

(b) The shift operator is ergodic with respect to P*. 

(c) E, [|log6_(Y )| +|log6+(Y )|] < +w where 

b-(y) = mfeee J 9e(x,y)X(dx) , (1) 
b+(y) = f sxLp eee f g e (x 1 y)\(dx) . (2) 

For any sequence of r.v. Z = {Z t }tez on (fi,P, J 7 ), let 

•^f d = 17 ({^«}«<fc) and Gk d == f °" ({^«}«>fc) 
be cr-fields associated to Z. We also define the mixing coefficients by, see [4], 

(3 z {n)= sup sup |P(B|J^)-P(B)| ,V»>0. (3) 

H5 There exist C G [0,1) and (3 £ (0,1) s.t. for any n > 0, /3 Y (n) < C/3™, 
where /3 Y is defined in ([3]). 

H6-(7) The block size sequence {r n }„>i satisfies X)fc>o T fc < 00 ■ 
Define for any 9 £ Q. 

8(9) = E, [E e [S^.Io.YoJIY]] . (4) 

R(9) = 9 (8(6)) . (5) 

G(s) d = S(0(s)) , Vs G <S , (6) 
where is given by F^T|jc]) . 

H7 (a) S and 9 are twice continuously differentiable on and S. 
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(b) There exists < 7 < 1 s.t. sp (V S (S o 0) s= §, e ^ < 7 where sp 
denotes the spectral norm. 



Note that under I^J sp(T) < 7, where V d = VG(s + ) and s* = S(0*). Set 

n 

„ def \ ""> _ def _ 

T n = T i ) T = . 



H8 (a) {T„ + i/T n } ra >o converges to q and 75 < 1. 



(b) hmsup„£Li{ 



ffk + \0gT k }/y/%, < 00. 



2 Checking Hg§) and H§ 



2.1 Checking Hg]b[ 

The following discussion has been suggested by R. DoucQ and E. Moulines^] 
The authors would like to thank them for the fruitful discussions. 

Assumption Ej4j[b]) can be easily proved when Y is a positive recurrent and 
V>-irreducible Markov chain. Assume first that Y is a one-sided Markov chain 
{Y t } t >o with invariant probability 7r defined on a probability space (f2, J 7 , P w ). 

We may choose CI = Y N and T = B(Y)® N , Y being the canonical process. To 
prove EpKl), we show that Y is mixing (see [U Chapter 13]) i.e. 



lim P T {Y G A,tf k {Y) G B} 

k— >-\-oo 



\{Y G A}f>„{Y G B} , MA, B G B(Y) S 



This is sufficient to prove Q when A is a cylinder, i.e. when there exists p G 



(7) 



such that A = {uj d = {wje> G Y K ; (w n ■ ■ ■ ,w lp ) G where H G 



and (ii 



is a p-tuple of distinct non-negative integers. For all sufficiently 



large k, by the Markov property, 

P T {Y G A, V k (Y) e B} = [1 A (Y)1 B (^(Y))] 



l A (Y)E Yi E Yfc _ s [l fl (Y)] 



Under the stated assumption on Y, we can choose Y such that 7r(Y ) = 1 
and for any x G Y , E |e y [1b(Y)]1 — > P V {B}. The proof is then 

L fc_1 p J k—>-+oo 

concluded by the dominated convergence theorem. This result implies that 
if Y is a two sided Markov chain {Y t } te z, with a positive recurrent and ip- 
irreducible transition kernel, then Y is ergodic. Indeed, P„. may be extended on 
, B(Y) m ) where #(Y)® Z is generated by Ui < j & zJ 7 i.j , with, for any integers 
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i < j, Ti j = a ({Yfc; i < k < j}). For any integers i,j and k s.t. i < j and any 
A, B G J- i j , 

v v {A n $- k (B)} = p^tf-'CA) n ir fc (ir l (5))} , 

where ?9~ 4 (A), d~ l (B) £ a ({Y&; k > 0}). Then, we can conclude as above. 
2.2 Checking H[5] 

1) Combining [11, Chapter 9] and recent results on the control of the ergodicity 
of Markov chains by coupling technique, see [6 , it can be proved that Hjs] 
holds for -0-irreducible, aperiodic and geometrically ergodic Markov chains. 

2) Upon noting that, for all n > 0, /3 Y (n) < /3' x,Y -'(n), we can similarly prove 
that E(4j[b| (and E{5]) hold when Y is the observation process of a joint ip- 
irreducible and aperiodic Markov chain (X, Y) (e.g. (X, Y) is a hidden 
Markov model). In this case, irreducibility, aperiodicity and geometric er- 
godicity have to be established for the Markov transition kernel Q of the 
X x Y valued Markov chain {(X t , Y t )} te z. 

3 Detailed proofs of [8] 

Recall the following definition from |5]: for a distribution \ on (X, £>(X)), posi- 
tive integers T, r and 9 £ 8, set 

S*T ( , jY) «i £ *j£ r+r (5,Y) ■ (8) 

t=T+l 

where 5 is the function given by H[l}|a]) and 

drf / x(da:r){n*=r "'g^i: x i+ i)gg(x I+ i, ^+i)} 5(x s _i, x s , ^) dA(z r+ i :t ) 
/ X( da; r){n*=r "^(a;,, ^+1 ).9e > ) } dA(x r+i:4 ) 

3.1 Proof of P Theorem 4.4] 

We check the assumptions of (7j Proposition 9] and [7l Proposition 11] with 
T{9) d ^ f R(0) (see F n (0) d = (S*&(0„, Y)) and C d ^ {9 e 6; R(0) = 0}. 
We start by checking the conditions of Proposition 11]. Under the stated 
assumptions, (a) holds. For (c) we prove that for any compact subset Kc9, 

WoR(J,)-Wo«(^(« ni Y)) 

By [5J Theorem 4.1], the function S given by Q is continuous on O and then 
S(/C) = {s € 5; 30 € /C,s = S(0)} is compact and, for any 5 > (small 



l0„eK — > P*-a.s. (10) 

n— >-+oo 
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enough), we can define the compact subset S(K,5) = {s € R d ; d(s, S(/C)) < 6} 

of 5, where d(s, S(/C)) = inf s ' e g(K:) I s — s 'l- Let S > (small enough) and 
e > 0. Since Wo# is continuous (see HjTJjcJ and (3 Proposition 4.2]) and S(JC, S) 
is compact, W o 9 is uniformly continuous on S(/C, 6) and there exists 77 > s.t., 



Vx,ye S(/C,<5) , |x-y| < 77 ^ |W o 6{x) - W o 6{y)\ <e 
Set a = S A 77 and A5„ d = |S(0 n ) - S?£n(9 n , Y)|l 9n£/C . We write, 



(11) 



\ { J W o 0(S(0„)) - W o 0(gx£; {fi n , Y)) 

= P, { J W o 0(S(0„)) - W o 0(5*^™ (0„, Y)) 

+ P, {|w o 0(S(0 n )) - W o (0 n , Y))| le neK 

< P* {A5„ ><5} + P* {AS n > 77} < 2P* {AS„ > a} . 



l e „ eK: > e; A5„ > S 



>e;AS n <5\ 



By the Markov inequality and [Sj Proposition 6.5], since 2 < pi < p 2j there 
exists a constant C s.t. 



\ {|W o 0(S(0„)) - W o 6(S*£t(6 n , Y)) 

|S(0„) 



2 ^ 

< E. 





1 


< c 




.Tn+1. 



Pl/2 



(10) follows from Bj&^p!) and the Borel-Cantelli lemma. The proof of the 
condition (b) follows the same lines. By [7] Proposition 11], this implies that 
limsup„p rl < +00 P* — a.s and that {9 n }n>o is a compact sequence P* — a.s. 
For the other statements, we apply [7] Proposition 9]. CnK, is compact since C 
is closed and JC is compact. We now prove that for any compact subset Kc8, 



|W(ft 



71+ 1; 



WoR(0„ 



le n eK — + P, 

n— >+oo 
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Since limsup^pn < +00 P* — a.s, it is sufficient to prove this convergence on 
the set {lu <G f2;limsup„p n ('i<;) < +00}. For any tu s.t. limsup„p„(w) < +00, 
there exists (a random) tiq s.t., for any n > no, p n (w) = p n +i(w) and then 
9 n+ i(w) = 9 n+1/2 {w), see [5J Eq.(6)]. Therefore, @ follows from ([lOj. 

3.2 Proof of [8, Proposition 6.6] 

Proposition |3.1| shows that we can address equivalently the convergence of the 



statistics {S^+i {@n> Y)}„>o to some fixed point of G and the convergence of 
the sequence {0 n }n>o to some fixed point of R. 

Proposition 3.1. Assume lffl(p 2 ), /Q[a|&]) ; iM and 1^ (pi ) for some 

2 < pi < P2- 
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dcf 



(i) Let 6* e C. Set s* = S(0 + ) = G(s*). Tften P* - a.s, 



lim |S*' T »- 1 ( 

n— ^+00 n 



dcf 



fii; Lei s* e S s.t. G(s*) = s*. Sei (9* = 0(s*) = R(0*). Then P* - a.s 
lim |0 n -0*|l ,x,T„_ lrfl v ^ =0. 



Proof. Let S be given by Q. By [8, Proposition 6.5] and H6j-(px)j 

lim(^»(0 n ,Y)-S(0 n )) =0 P* 
By [8, Theorem 4.1], S is continuous. Hence, 



a.s . 



lim 



S*T»(0 n> Y)-s(0*) 







a.s 



and the proof of follows. Since 6 is continuous, (JTTj) follows. □ 

We start with rewriting some definitions and assumptions introduced in [5]. 
Define the sequences S n , \i n Pm n > by 



So = S*;°(8 , Y) and S n d = S*£l (d n , Y) , Vn > , 
where S*' T is given by fio — 0, po — So — s+ and 

dcf 



dcf _ 

Pn — L Pn—1 T Cn j 



5n Mr- 



n > 1 



where, 



dcf 



e„ = £V> - S(0„) , n > 1 



(13) 

(14) 
(15) 



and S is given by (U| 



(proof) Let p G (2,p 2 ). By for all n > 1, //„ = ELo r fe e„_ fc . By 



and the Minkowski inequality, for all n > 1, ||Mnll* p ^ Sfc=o 7^ ll e n— fcll* p - By 



(13) and [8, Proposition 6.5], there exists a constant C s.t. for any n > 1, 



i^il, p <^EA ^ 



fc=0 



n+l-fc 



By [5J Result 178, p. 39] and H[8f|a| (upon noting that q > 1 so that ^7 < 1), 
this yields y/T^fi n = O hp (l). 
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By H(7] using a Taylor expansion with integral form of the remainder term, 

d 

= ^2 (S n -l,i - s ±,i) ( S n-l,j ~ Rn-l(i,j) 



^ (p n -l,i + Pn-X,i)[^n-1,3 + Pn-l,j)Rn-l (h j) , 



where a;„ , denotes the i-th component of a;„ G M d and 



. \ dcf 



9 2 G 



Rn{i,j) = J (l-t)-Q-Q-(s* + t(S n -s*))dt, neN,l<i,j <d. 

Observe that under Fl[7] limsup„ |i? n |liim„ e„=e* < oo w.p.l. Define for n > 1 
and k < n, 



H n = y^ y ( 2 ^n,i + Pn,i)Rn(i, ') > 



del' 



<5„ — S(0„) — G(5 n _i) , 



<lcf 



dcf 



(16) 



= (T + fl n )..-(r + fl- fc ) (17) 



dcf - 



with the convention ip(n,n + 1) = Id. By (14 1, 

n— 1 n 

p„ = ^(n - 1, 0)p + X] _ X > fc + x ) r * + H ^( Tl - 1 ' ■ ( 18 ) 



fe=0 



fc=i 



Since Jr^Pn — Ol (1), H|6]-(pi) implies that /i„ — > P* — a.s. Then, by 
Q , Pnliim„s„= s , — ► OP* - a.s and by (Il6| lim |if n | l lim „ s„=s, = 

1 1 n— >+oo 1 1 n— >+oo 

P* — a.s. Let 7 G (7, <? _1 ), where 7 is given by r^7| and q by H[8j[aj) . Since 
lim \H n \ l\im n S n =8* — 0; there exists a P* — a.s finite random variable Z\ s.t., 

n— >+oo 

for all < fc < n - 1, 

|V(n - 1, fc)| l Umn Sn =.. < 7"" fe ^iliim„ s„= s , ■ (19) 

Therefore, \ijj{n- 1,0) p \ li im „ s„= s * < l n Zi \p \ P* - a.s, and, by Htf}(p 2 ), 
(|8} and ([13]) [|p | P2 ] < +00 which implies that p < +00 P* - a.s. Since 
97 < 1, the first term in the RHS of Q is t- 1 o Lp (l)O a . s (l). 

We now consider the second term in the RHS of (18 1. From equation (19 1, 

n-l 



^(n-U + l)r t 



fe=0 



n-l 



llim„ S„=s* < ^1 ^ 



f-^lrfcllun,,.^.. 



— a.s 



fc=0 



By ( 16 1 and 1^7j there exists a P* — a.s finite random variable Z2 s.t. 



K| liim„s„= s , < Z 2 X Pk,iP>k,j >P* - a.s . 



7 



In addition, since • v /rv^/i T! = there exists a constant C s.t. 



ra-l 

E 



< c 



\ - 7 



n—k— 1 



,p/2 



fc=0 



Applying again [§1 Result 178, p. 39] yields that the second term in the RHS 
of (18) is r- 1 C a . s (l)0 Lp/2 (l)- We finally consider the third term in the RHS of 
(fl8|. ByEfHlc]), on the set {uj € Q; lim S n (ui) — s*} we have lim 6(S n (u>)) = 



(9(s*). Hence, for any u; € the set {^(S'n(w)} Tl >o is compact and 6 n +i{uS) = 
n+ i/2(u) for all large n. By (|6) and Q, there exists a random integer rio( w ) 
s.t. for all n > n (uj), S n (uj) — 0. Then, there exists aP t - a.s-finite random 
variable Z3 s.t. for all n > 1, 



^ ^(n - 1, fc)^fellim„ s„ 



fc=l 



3 • 



Since under H[7] lim„ T n j n — 0, this implies that the third term in the RHS of 

P) is T^0 M {1). 



4 General results on HMM 

In this section, we derive results on the forgetting properties of HMM (Sec- 
. 4.1 ), on their applications to bivariate smoothing distributions (Section |4.2[ ), 



tion 



on the asymptotic behavior of the normalized log-likelihood (Section |4.3[) and 



on the normalized score (Section 4.4) 



We consider a HMM with kernels m(x, x')dX(x') onto Xx6(X) and g{x, y)dv{y) 
on (X x £?(¥)). X is a general state-space equipped with a countably generated 
cr-field S(X), and A is a bounded non-negative measure on (X,23(X)); v is a 
measure on (¥,£?(¥)). 

For any initial distribution x 011 (X,6(X)), any r < s <t and any sequence 
y 6 Y z , define the probability measure <&*' t r (-,y) by 



(20) 



del J x(dx r ){l\ %l m(xj, x i+1 )g(x i+1 ,y i+1 )} h{x s ^i,x s , y s ) dX(x r+1:t ) 

/ X( d ^r){ni=r m ( X i> X i+l)9{ X i+U Vi+l)} d\(x r+1:t ) 

for any bounded function h. 

For any sgZ and any A G $(X), define 

L.foA) d = / m(x, x')g(x', y a+1 )l A (x')X(dx') , (21) 
and, for any s < t denote by Lg tS:t the composition of the kernels defined by 



L S :s — L s , 



L s . u+1 (x, A) d = / L s:u (x,dx')L u+1 (x',A) 



By convention, L s - s -\ is the identity kernel: L s . s _\{x, A) — S X (A). For any 
sequence y 6 Y z and any function h : X 2 x Y — > E, denote by h s the function 
on X 2 -> I given by 

h s (x,x') = h(x,x',y s ) . (22) 
With these notations, equation (p0| becomes 



,v r/ , \ rx(dx r )i r:s _ 2 (a; s ,da; 5 _i)^ s (a: s _i,x s )L s:i _i(x s ,X) 

t 0> y) = r ,, , T ; ttt • (23) 

4.1 Forward and Backward forgetting 

For any y € Y z , any probability distribution x on (X, B(X)) an d for any integers 
such that r < s < t, let us define two Markov kernels on (X, £>(X)) by 

t? i a\ d ± { f L s (x,dx s+1 )1 A (x 5+ i)L s+1:t _ 1 (a; s+ i,X) 

L s:t -i(x,X) 
dot /^(d^^U^M^.a;) 

s ( ' } _ jcn^m ' ( } 

where 

/,x,r r dof J x(ds r )ir:s-i(^r,da; s )l A (a; s ) 



^ s|r:sl j = /x(da; r )L r:a -i(a: r ,X) 



Finally, the Dobrushin coefficient of a Markov kernel F : (X, B(X)) — > [0, 1] is 
defined by: 

5(F) sup \\F(x,-)-F(x',.)\\ TV . 

Lemma 4.1. Assume that there exist positive numbers <j-,a + such that er_ < 
m(x,x') < o~+ for any x,x' G X. Then for any y G Y z , d(F St t) < p and 

<5(B*' r ) < p where p = o~-/a + . 

Proof. Let r, s, i be such that r < s < t. Under the stated assumptions, 
L s (x s , dx s+ i)l A (x s+ i)L s+ i; t -i(x s+ i,l&) 

> cr- J g(x s+ i,y s+ i)l A (x s+ i)L s+ i :t -i(x s+ i,X)X(dx s+1 ) 
and 

L s:t -i{x s ,X) < (7+ J g(x s+ i,y s+ i)L s+ i :t -i(x s+ i,X)X(dx s+ i) . 
This yields to 

er_ / j/ s+ i)L s+ i :t _i(a; s+ i,X)l J 4(a; s+ i)A(da; s+ i) 



F a , t (a; fl) ^) > 



cr + J g(x s+ i,y s+ i)L s+ ut-i{x s+ i,X)X(dx s+ i) 
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Similarly, the assumption implies 



a + i 

which gives the upper bound for the Dobrushin coefficients, see O Lemma 
4.3.13]. □ 

Lemma 4.2. Assume that there exist positive numbers o~_,<7 + such that er_ < 
m(x, x') < a + for any x, x' € X. Let y G Y z . 

(i) for any bounded function h, any probability distributions \ an d X an d an V 
integers r < s < t 



J x(dx r )L r:s -i(x r ,dx s )h(x s )L s ..t-i(x s ,X) 



J x(dx r )L r:t -i(x r ,X) 
J x(dx r )L r .. s - 1 (x r ,dx s )h(x s )L S ; t _i(x s ,X) 



J x(dx r )L r:t -i(x r ,X) 



< p s r osc(h) , 

(26) 



(ii) for any bounded function h, for any non-negative functions f and f and 
any integers r < s < t 



J x{dx s )h(x s )L s . t ^i(x s ,dx t )f(x t ) 



J x(dx s )L s ..t- 1 (x s ,dx t )f(x t ) 

J x{dx s )h(x s )L s:t _ 1 (x s ,dxt)f(xt) 



J x(dx s )L s . t ^ 1 (x s ,dx t )f(x t ) 



< p'- s osc(/i) 



(27) 



Proof of |]). See |3J Proposition 4.3.23]. 

Proof of ([n} When s = t, then ([n]) is equal to 



J x(dxt)h(x t )f(x t ) J x{dxt)h(x t )f(x t ) 



J X (dx t )f(x t ) 



J X (dx t )f(x t ) 



This is of the form (rj — rj)h where r\ and r\ are probability distributions on 
(X,B(X)). Then, 

| (r) - rf) h\ < ^ | \rj - rj\ | TV osc (h) < osc(h) . 



Let s < t. By definition of the backward smoothing kernel, see (25 1 

/ x(dx s )l A (x s )m(x s ,x s+1 ) 



Br(x s+ i,A) 



J x(dx s )m(x s ,x s+ i) 
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Therefore, 

x(dx s )h(x s )L s:t -i(x s ,dxt)f(x t ) 

X(dx s )L s (x s ,dx s+ i)BX> s h(x s+ i)L s+1 .. t -i(x s+ll dx t )f(x t ) . 
By repeated application of the backward smoothing kernel we have 

x(dx s )h(x s )L s:t _ 1 (x s ,dx t )f(x t ) = / x(dx s )L s:t _ 1 (x s , dx t )B^ l . s h(x t )f(x t ) 



where we denote by B^l^ the composition of the kernels defined by induction 
for s < u 



^s-.s s > 



E&' a (x, A) d ^ J Br(x,dx')B^ 1:s (x',A) 



Finally, by definition of (/>*^. t 



J x(dx 3 )h(x s )L s: t-i(x s ,dxt)f(x t ) J x(dx s )h(x s )L a:t -i(x a ,dx t )f(x t ) 



J x(dx s )L S ; t _ 1 (x s , dxt)f(xt) 



J x(dx s )L s:t ^i(x s ,dxt) f(x t ) 





(k X ' S 


'(B* 


\ J>) f 


fits 8 , if] 




t\s:t 


y 





This is of the form (?/ — rj) Bf^^h where rj and 77 are probability distributions 
on (X, S(X)). The proof of the second statement is completed upon noting that 



< 2 Ha* - mIItv' 5 ( B ?-i:J osc(h) < P *- s osc(/ l ) 



where we used Lemma |4.1| in the last inequality. □ 
4.2 Bivariate smoothing distribution 

In this Section, the kernel m and g may depend on a parameter 9 G 6. The 
parameter 9 is then introduced in the notation for a better clarity. 

Proposition 4.3. Assume Let x, X be two distributions on (X,B(X)). 

For any measurable function h : X 2 x Y — > R d and any y £ Y z such that 
sup x x , \h\(x, x', y s ) < +00 for any s € Z 

(%) For any r < s <t and any > 1. 



sup 



,s,t+i 2 



(h,y) < ( P s - 1 - r + p*- s )osc(M-,-,y s )) 



(28) 
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(ii) For any 9 £ Q, there exists a function y t— > <frg(h,y) s.t. for any distribu- 
tion x on (X,£>(X)) and any r < s <t 



sup 

eee 



®Z,t (h, y) - <*>e (h, $ s y) < {p*- 1 ^ + p^ s ) osc(M-, -,y.)) • (29) 



Remark 4.4. (a) If x = X> h = an d h > 1, (28 1 becomes 



sup 

eee 



% 0»> y) - $^ jt+ , 2 (ft, y) < p t - s osc(/ ls ) . 



(b) if £ 2 = and t x > 1, (28 1 becomes 



sup 

eee 



% y) ~ KIT (h, y) < p-^oscft.) . 



Proof. We will use the shorthand h s for h s (x,x') = h(x,x' ,y s ). 

(i) Let r, s, i such that r < s < t, l\,l<z > 1, and 9 £ 0. Define the 
distribution X0,r-titr on (X, S(X)) by 

dof /x(da;r-£ 1 )ie,,'-^ 1 :r-i(a;r-i! 1 ,da; r )l J 4(a; r ) 

Xe,r-^:r(A) = jt—tz — ? -r , VA G B(X) . 



We write 



/ x(da; r _£ 1 )L , r _f i:r _i(x r _£ 1 ,X) 
*f£t (>*, y) - ®Z7t+h y) ^ + ^ where, by using ©, 



Ti = 



def / x(dx r )Lo,r:s-2(Zr,<tes-l)h s (x a -i,X s )Lg !S ^ 1 (x s -i,dx s )Le tS:t -i(x s ,X) 



J Xr(dx r )Lg^ r:t ^i(x r ,X) 

J Xe,r-£i:r(dx r )Le ;T . :s _ 2 (x r ,dx s _i)/i s (a; s _i,x s )Lg |S _i(x s _i,da; s )Le !S . t _i(x s ,X) 

/ Xfl,r-/ 1 :r(da;r)-£'fl,r:t-l(aVjX) 



and 



7; 



/ Xe,r-fi:r(dx r )Le )r:t _ 1 (a; r ,X) 
/ Xe,r-^:r(da; r )Le ir:s _2(a; r ,da; s _i)/!. s (a; s _i,x s )Le )S _i(a; s _i,dx s )Le iS:i+ ^ 2 _i(x s ,X) 



/ Xe,r-^i:r(dx r )Ir9 )r :t+^-l(x r ,X) 

Set ! x 1 — ^ J" Fg iS -i it (x, dx s )h s (x, x s ) where Fg tS —x t t is the forward smoothing 
kernel (see e.g. (p4|). Then, 



J x(dx r )Xe !r:a _ 2 (x r .,dx s _i)fe a ,t(x s -i)Xe !S _i :t _i(x s -i,X) 
J Xr(dx r .)£e )T . :i _ 1 (x r .,X) 
J Xe,r-^i:r(dx r )ie )r:5 _ 2 (x r ,dx s _i)/i Si i(x 5 _i)-Le, s _i :t _i(x s _i,X) 



/ xe,r-^i:r(dx r )ie ir: t_i(x r ,X) 
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By Lemma 4.2 i|, 



T 1 <p s - l - r osc{h s j)<2p s - l - r S up\h s ^x)\<2 P s - 1 - r sup \h s (x,x') 

xeX (x,x')eX 2 



Set h s i x i y J Bg*'!^ 1 '* 1,8 (a;, dx s _i)/i s (a; s _i, x) , where B^'l-^ 



X8,r-/i:«-l)« — i • 



IS 



the backward smoothing kernel (see ( 25 1 ) . Then 

f 2 



J Xe,r-e 1 :s(dx s )h s (x s )L e:S . t _ 1 (x s ,dx t )L eit . t+e2 _ 1 (x t ,X) 
f xe,r-e 1 :s(dx s )Le, S :t-i(x s ,dx t )L() it ;t + £ 2 -i(xt 1 X) 

J Xe,r-ii:s(dx s )h s (x s )L0 tS:t _i(x s ,T) 



J xe,r-i 1 :s(dx s )Le vS ; t -i(x s ,X) 



Then, by Lemma 4.2 



% < /O i_s osc(/i s ) < 2p t " s sup|^ s (a;)| < 2p t - s sup \h s (x,x')\ . 

xex (x,x')ex 2 

The proof is concluded upon noting that, for any constant c, 



osc(h) = 2inf < sup \h s {x, x') — c\ 
ceR I (x,x')ex 2 



(ii) By (28 1, for any increasing sequence of non negative integers (rt) 



e>o, 



(ti)e>o s.t. limr^ = lim^ = +oo, the sequence {<i>g '„ r t l (h,y)}i>o is a Cauchy 



sequence uniformly in 9 and \. Then, there exists a limit & g (h,y) s.t. 



lim sup sup 



*%o£(.h,y)-*o fry) 

We write, for any r < s < t and any I > 1 



= . 



(30) 



Kit (h,y)-* 9 Wy) 



< 



% (h, y) - (h, y) + (h, y) - $ e (h, 0* y) 



Since $££ t _^ (h,y) = ^g^t+^-s (M^), Proposition 4.3 (IJ yields 



(M s y)-$ e (M s y) 



The proof is concluded by ( 30 1 . 



□ 



Lemma 4.5 is a consequence resp. of (J9j) and Proposition 4.3 iij. 

Lemma 4.5. Assume Le£ r < s < t be integers, 9 £ Q and y G Y z , and 

h : X 2 x Y — > R d s.t. for any seZ, sup.,, x , \h\(x, x', y s ) < oo. Then 

$flt(h,y) < sup \h(x,x',y s )\ , \$ g (h,# s y)\ < sup \h(x,x',y s )\ . 



(x,x')ex 2 



(x,x')ex 2 
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4.3 Limiting normalized log-likelihood 

Define for any r < s. 



«y) = ^:J + i(y)-«y) 



(31) 



where ^g'J +1 (y) is defined by 



drf / r "+ 1 
r'J +1 (Y) = log / x(dav) meK-ii x u )ge(x u , Y u ) A(dx r +i) • • • A(dx s+ i) 
V"^ u=r+l 

(32) 

For any T > and any probability distribution \ on (X,f?(X)), we thus have 



T-1 



T-1 



#(y) = E « + i(y) - = E *2» 



(33) 



s=0 



s=0 



It is established in Lemma 4.6 that for any 9 € 0, y G Y , s > and any initial 
distribution the sequence {8f '* r (y)} r >o is a Cauchy sequence and its limit 
does not dep end upon \- Regularit y co nditions on this limit are given in Lem- 



4.7 



and 



Finally, Theorem 4.9 shows that for any 8, \im T T' 1 £f^(Y) 



exists w.p.l. and this limit is a (deterministic) continuous function in 6. 
Lemma 4.6. Assume ifH 
(i) For any £,r,s> 0, any initial distributions %i x' on X and any y E Y 1 



sup 



? (y) 



< 



p 



(ii) For any 6 £ 0, i/iere exists a function y t— > 5$(y) such that for any initial 
distribution \, any y £ Y z and any r, s > 0, 



sup 

eee 



'(y)-Mtf s °y) 



< 



Proof. Proof of |i]). Let s > and r and r' be such that r' > r. By (31) and 
p2|, we have |<5£ ' s s_r (y) - S^ s ~ r (y)| = |loga- log/3| where 



OL = ! 

J x(dx s _ r ) rii=s-r+i meixi-i, Xi)gg{xi, yi)X{dxi) ' 
„ def J x'(dx s ^. r >) Y\i=l-r>+i m g (xi- 1 ,Xi)ge(x i , y l )\{Ax i ) 

J X'{dx s -r') lL!= s -r'+l "^(Xi-^Xj)^^, ^)A(dXi) 



(34) 



We prove that 



aA/3>a- J g 8 {x s +i, i/ s+ i)A(dx s+ i) , 
< 2/(7+ / gg(x s+ i,y s+1 )X(dx s+l ) , 



(35) 
(36) 
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and the proof is concluded since | log a — log /3\ < \a — /3\/(a A j3). 

The minorization on a and f3 is a consequence of H|2]upon noting that a and 
(3 are of the form J p(dx s )me(x s ,x s +i)g8(x s +i,y s +i)\(dx s +i) for some proba- 
bility measure p. The upper bound on \a — /3| is a consequence of Lemma 4.2 i| 
applied with 

x(da; s _ r ) <- / x'( da; s-r') { TT 9e(xi,yi)m s (xi,Xi + i) > A(da; s _ r / + i :s _ r ) 

and <r- j ge(x s+ i,y s+ i)m 8 (u,x s+ i)\(dx s+ i). 

Proof of By for any y G Y z , the sequence 'o~ r (y)}r>o is a Cauchy 
sequence uniformly in 9: there exists a limit denoted by Se(y) - which does not 
depend upon x - such that 



lim sup 

7-++0O g ee 



W(y)-Se(y) 



= 



(37) 



We write for r < r' 

^r r (y)-^(^°y) < ^r r (y)-^.r r '(y) + ^.r r '(y)-^ s °y) 



Observe that by definition, Sg s (y) = Sg (")? s o y). This property, combined 
with Lemma 4.6 yield 



sup 

oee 



^.r r (y)-Mtf s °y) 



< 



p +sup 

1 - P see 



#(T (^°y)-^(^o y ) 



When r' — > +oo, the second term in the rhs tends to zero by (37 1 - for fixed y, s 
and x -■ This concludes the proof. □ 

Lemma 4.7. Assume For any y G Y z and s > 0, 



sup sup 



C" r (y) ^ |log^ + fo + (y.s + i)| + |loga_6_(y s+1 )| 



and, /or any r > 0, 



see (1-/9) 



|loga+6 + (yi)| + |logo-_6_(yi) 



where 6+ and o_ are defined by |Ip. 

Proo/. For any < m < A/B < M, | log(A/S)| < |logM| + |logm|. Note that 
by definition, Sg^(y) is of the form log(A/B) and under lQc|, er_6_(w s+ i) < 
A/B < cr + b+(y s +i). The second upper bound is a consequence of Lemma |4~6} |ii| . 

□ 
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Lemma 4.8. Assume igfla]) and iQc|. T/ien, 6> >-+ E* [5 e (Y)] is 

continuous on 0. //in addition O is compact, 



limE, 



sup \5 e (Y)-S e >(Y)\ 

{8,e>ee;\9-8>\<ri} 



= P* - a.s 



(38) 



Proof. By the dominated convergence theorem, Lemma 4.7 and I^4f|c|, 9 i-> 
E* [<5g(Y)] is co ntinuous if # h-» 5g(y) is continuous for any y 6 Y z . Let y £ Y z . 
By Lemma 4.6 In]), lim r ^ +00 sup eee |<5£; ~ r (y)-6 e (y)| = 0. Therefore, 9 h-> 5g(y) 
is continuous provided for any r > 0, 9 t— > Sg'Q T (y) is continuous (for fixed y and 
x). By definition of 6* 'Q~ r (y), see (31 1, it is sufficient to prove that 9 i— >• ^g'^" r (y) 
is continuous for s € {0, 1}. By definition of £%'~ r (y), see (32 1, 

^e'7 T (y) = lo § / X{dx-r) Yl m e(xi-i,Xi)g e (xi,yi)X(dxi) . 
J j=- r +i 

Under Hljjab, i— > Yli=- r + 1 m o{ x i-ii x i)96{%ii Vi) is continuous on 0, for any 
In addition, under H[l] for any 9 E 6, 



J| me(xi,x,- +1 )flifl(a:t + i ) yi +1 ) 

= exp I (s + r)0(0) + / ip(d), ^2 s ( x i, x i+i,yi+i) 



i — — r+1 



i=— r+1 



Let K. be a compact subset of 0. Since by f^T] </> and ^ are continuous, there 
exist constants C± and Ci such that, 



sup 

eeK. 



i— — r+1 



m 9 (xj , x i+ ! ) tte (x i+ ! , y i+ 1 

<CiexplC 2 ^ sup |5(x,x,' ,y;+i) 



i — — r+1 



Since the measure %(dx_ r ) 0^ = _ r+1 A(dx^) is finite, the dominated conver- 
gence theorem now implies that £g'~ r (y) is continuous on 9. 

For the proof of ( |38| , let us apply the dominated convergence theorem again. 
Since is compact, for any y € Y z , 9 H> <5#(y) is uniformly continuous and 
lim sup \Sg{y) — Se'(y)\ = 0. In addition, we have by Lemma 

^ — 6» — 6" | <»7 



4.7 



sup \Se(y) - <V(y)l 
{e,e'ee ; |e-9'|<77} 

<2sup| ( 5 e (y)| < 

eee (1 - P) 



+ 2 { | log ct+ &+(j/i)| + I log a- 6— (s/x)]> 
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Under I^4]Jaj) and E|4j|c|, this upper bound is P*-integrable. This concludes the 

□ 

R by 



proof. 

Theorem 4.9. Assume 7^7] /Q Define the func tion : 

c*(0) = E^ [<5e(Y)] ; where Sg(y) is defined in Lemma J^.l . 



(i) The function i— > c*(0) is continuous on G. 
ftt) for any initial distribution \ on (X, S(X)) 
1 



^$(Y)-c(0) 



where ^?'£(Y) is defined in (32 I. 



T->+oo 



a.s 



(39) 



(liij //in addition G is compact, for any initial distribution \ on (X, B(X)) 

P* - a.s . (40) 



sup 

see 



Proof, (i) is proved in Lemma 



T^+oo 



fiij By (33 1, for any T > 0, we have, for any ye Y 

T— 1 

^;£(y) = ^E^°(y) 



s=0 
T-l 



1 T-l T-l 

r E - s ^ s ° y)) + t E ^ ° y) 



s=0 



By Lemma 

(0,1), 



4.6 



S}, for any < s < T - 1, <5*' S °(Y) - 5 e (f o Y) 



< 2/-. Since 



= P* - a.s . 



By Lemma |4~7 



E*[<y 9 (Y)] < -+E i [|loga +0+ (Y 1 )| + |loga_ _(Y 1 )|] , 

and the rhs is finite under assumption E(4j[c]) ■ By E 4 a|b I , the ergodic theorem, 
see [TJ Theorem 24.1, p. 314], concludes the proof. 

(Hi) Since G is compact, (40 1 holds if for any e > 0, any 6' G G, there exists 
r] > such that 



lim sup 

t^+oo {6»;|e-0'|<^}ne 



T-Hf° T (Y) - T- 1 ^;° T (Y) 



< e 



a.s 



(41) 



17 



Let e > and 9' £ O. Choose 77 > such that 



sup \5 (Y)-S ,{Y)\ 
{6ee;\e-8'\<>n} 



< e 



(42) 



such an 77 exists by Lemma 4.8 By (33 1, we have, for any 9 £ such that 
\9-9'\ < 77 



T-l 



^ (7,i \ / rp 

In addition, by Lemma |4.6Hii 



s=0 



(43) 



T-l 



T-l 

< 2 E sup <$;°(Y) - 5 e (^ s o Y)| + E l^(^ s o Y) - 6 e ,(& s o Y) 



T-l 



s=0 



s=0 



T-l 



< 



(1-P) 2 



where H(y) d = su.p {6e& .\ e _ e ,\ <v y \S g (y - 6g>(y)\. This implies that 



T-l 



lim sup I E |^°(Y) - <5*/° (Y) < lim I £ S(^ s o Y) . 

Under H4ja[bl, the ergodic theorem implies that the rhs converges P* a.s to 
[H(Y)], see [Q p.314]. Then, using again @, 



T-l 



lim sup ^E|C( Y )-C( Y 



{e e e;|e-8'|<»7} ^ 



s=0 



< e P* - a.s . 



Then, (41 ) holds and this concludes the proof. 



□ 



4.4 Limit of the normalized score 

This section is devoted to the proof of the P* — a.s convergence of the normalized 
score T~ 1 Vg£ g c '^(Y) to V#c*(#). This result is established under additional 
assumptions on the model. 

SI (a) For any y £ Y and for all (x, x') <E X 2 , 6* h-> ge(x, y) and 6* i-> mg(x, x') 
are continuously differentiable on 0. 

(b) We assume that E* [0(Yo)] < +00 where 

0(y) d = sup sup \V g logm e (x,x') + Vg\ogg e (x',y)\ . (44) 

066 (x,x')eX 2 
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Lemma 4.10. Assume ^1] For any initial distribution x> an U integers s,r > 
and any y £ Y z such that 4>(y u ) < +00 for any u £ %, the function 8 1— > 
£g's~ r (y) is continuously differ entiable on 9 and 

v<r r (y)= E ^CT^y), 



where Tg is the function defined on X 2 x Y &y 

T e : (x, 2', y) >->■ V e log x')ge(x', y)} . 

Proof. Under £|T] the dominated convergence theorem implies that the function 
8 1 — ^ ^g's _r (y) is continuously differentiable and its derivative is obtained by 
permutation of the gradient and integral operators. □ 

Lemma 4.11. Assume L^and ^IJ 

(i) There exists a function £ : Y z — > W.+ such that for any s > and any 
r , r ' > Sj arl 2/ initial distribution XiX' on X and any y £ Y z suc/i i/iai 
^)(y u ) < +00 /or any it £ Z, 



sup 

see 



Vfl$r r (y)-v*#.r r (y) 



where 



1-P 

£(y) d =E^)/> H/4 - 



< l^! (0 (r'Ar)/4 £ (y) 



(45) 



(n) For any y 6 Y z satisfying £(y) < +00 , i/ie function 8 1— > <5e(y) given by 
Lemma 4-6 n| is continuously differentiable on 0; and, /or any # G 0, any 
initial distribution \ an d an y integers r > s > 0, 



sup 

see 



v^;r r (y)-v e <s 9 (yo^) 



< ^> £(y) 



4.10 



Proof. ^ By definition of <5g s s r (y), see (31 1 and Lemma 

- E (^;r s ^ 1 (T,,y)-<f^rr 1 (^,y)) 

u=5— r 

- E (<Ci" 1 ( T «.y)-<i7 r '" 1 ( T «.y)) 

u=s—r' 

+ ^:;;[;+i(T„y)-$&^(T„y). 
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We can assume without loss of generality that r' < r so that 

v<r r (y)-v e <r r '(y) 

u— s— r 

+ E Hc+^( T *>y) - 'K:.: : >^,y ; - ^s^w) + *&r'- 1 (T S ,y)} 



u— s — r' 



Under r^2jand qlj Remark jL4 can be applied and for any s — r < u < s — r' — 1, 



where <j> u (y) is defined in (44 1. Similarly, by Remark 4.4 



*££wi(Tfl,y) - <'+~iii(Tfl,y) < 2/' + V(y,+i) 



For any s — r' < u < s, by Remark 4.4 



< 



X,s— r — 1 



(^yJ-^r+V^y) 



X ,8— T — 1 



(T*,y)-*ftT. (T*,y) 

< 4^+ r '-^(y u ) 



and by Remark |4.4 



\X !«— r — 1 



X ,«— r — 1; 



< 



X,s-r-l / 



3> 



x' ,s — r' — 1 



s+1 (T e ,y)-$^7'^(T e ,y) 
< 4p s -«0(y„) . 



Hence. 

v e ^;r r (y)-v e ^ s - r '(y) 



S— T — 1 



s+1 



u—s — r 



u—s — r 
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Furthermore, 

s+1 



E #y«) 



< E 

s-r'<u<[s-r' /2J 

</ /2 E^> H/4 ••• 



p s ~Xy«) + E ^ +r ' s 



<^(y«) 



u>[s-r'/2j 



< 2 



£ ^-«-r'/2- H /4 + £ 

u<[s-r'/2j L S - J "7 2 J+ 1 < U < S + 1 
,(r'-l)/4 

t— E^> H/4 > 



u+r'/2-s-|«|/4 



where we used that sup s _ r ,< u < Ls _ r , /2 j |u| < r' and sup Ls _ r , /2j+1 < u < s+1 \u\ < 
?•' + 1. Moreover, upon noting that — u/2 + (s + l)/2 < s — u — r'/2 when 
U < s — r' — 1. 



s— r — 1 s— r — 1 

E <Ky«)p s ~ M < / /2 E <Ky> s ~ u ~ r ' /2 

</ /2 "E^(y«)^" /2+(s+1)/2 



s — r —1 



< 



r'/2 (a+l)/2 



E ^(y«)p |u|/2 » 



where we used that s — r' — 1 < in the last inequality. 
Hence, 



sup 

oee 



v e #r r (y)-v*a?r r (y) 



< 



16 
1 _ / 



JV'-l)/4 



£#y«)p |u 



1/4 



(46) 



([TTJ) Let y € Y z such that £(y) < +oo. Then for any u e Z, </>(y u ) < +oo. By 
and Eq. (31 1, the functions {9 H> Sg ' ~ r (y)}r>o are C 1 functions 



Lemma 



4.10 



on O. By (lib, there exists a function H> 5e(y) such that 



lim sup 
Furthermore, by Lemma |4.6| 



= . 



lim sup 



^ r (y)-^(y; 
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Then, 9 i-> S e (y) is C 1 on 6 and for any 9 € 9, <5 e (y) = V e oe(y). 

We thus proved that for any y S Y z such that £(y) < +oo and for any initial 
distribution x, 

' = . (47) 



lim sup 

r->+oo 9ee 



V e <5 e x V r (y)-V^ e (y) 



Observe that by definition. Veo"^'* r (y) = Vgo"^ ' r (i3 s o y). This property 



combined with Lemma 



4.11 



i|, yields 



sup 

see 



v^r(y)-^(**oy) 



< 



16p" 



-1/4 



1-P 



r/4 



9E& 



£(y) + sup V e ^' - r '(i? s o y) - V fl <J fl (0 fl o y) 



Since £($ s o y) < +oo, when r' —> +oo, the second term tends to zero by (47) 
for fixed y, s and x -• This concludes the proof. 

Lemma 4.12. (i) Assume For any y g Y z smc/i i/iat 4>(y u ) < +°o /o 
any iifZ, /or any integers r, s > 0, 



□ 



sup 



s+l 



<2 J] 0(y„) 



u—s — r 



(ii) Assume and ^7] Then, for any y G Y z such that £(y) < +oo and /or 
any r > 0, 



sup \V B Se(y)\ < 2 T #y„) + 1 ^^(y)// /4 , 



-1/4 



where £(y) is defined in Lemma J,.. 11 



Proof, (lib By (31 1 and Lemma 4.10 



< 



|v^;r r (y) = v £* : :;[(y)-v^ ; r r (y) 

/ x(dx s - r )L0 jS - r:u -i(x s - r ,dx u )V0log[rn0(x u - 1 ,x u )g0(x a ,y u )]L0 iU:s - 1 (x u ,X) 



2 E 



u— s — r 



J x(dx s - r )L0 iS - r:s -i(x s - r ,X) 

The proof is concluded upon noting that for any s — r <u < s + X, 

J x(dxs-r)ge(x s - r , y s - r )L0 tS - r:u -i(x s - r , Ax u )V e log ge(x u ,y u )L g ^ u ., s ^ 1 (x u , X) 



J x(da; s -r).9e(a;s-r,ys-r)Le :S _ r : S _i(a; s _ r ,X) 

is upper bounded by <f>(y u ). 

([ii]) is a consequence of Lemma 4.11[ [ii I and Lemma 4.12p | . 



□ 



Theorem 4.13. At 



i{§ 410® and ^ 
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(i) For any T > and any distribution \ on X, the functions 6 i— > £X' T (Y 
and 9 i— > c*(#) are continuously differentiable P* — a.s. 

(m,) For any initial distribution \ on (X, B(X)), 



^V^(Y) — > V eC>t (0) P*-a.s. 

T ' T^+oo 



(48) 



Proof. By (33 1 and Lemma 4.10 for any y such that </>(y„) < +oo for any 
u G Z, ^ay(y5 and 5%' (y) are continuously differentiable and (33 1 implies 



T-l 



v^(y) = E v < s °(y)- 



s=0 



This decomposition leads to 

T-l 

v^-Hm = - 

T 



T— 1 T— 1 

iv e ^(Y) = - £ (V^; S °(Y) - V e <W s o Y)) + i £ V e <5 e (r o Y) . 

s=0 s=0 

(49) 

Consider the first term of the rhs of (49 1. Since Y is a stationary process, 

vhere £ is defined by 
for any < s < T - 1, 



assumption qTJJbj) implies that E+ [£(Y)] < +oo, where £ is defined by (45 1 
Then, £(Y) < +oo P* a.s and by Lemma 



4.11 



V e £*;°(Y) - V e 5 e (ti s ° Y) < £(Y)^— P 



16/o 



-1/4 



s/4 



Therefore 



T-l 



1 .^.Wp- 1 / 4 1 



T 



£|v,**°(Y)-V,fc(rf'oY) < (Y) " _ p i- p i 



s=0 



/4 



and 



1 T_1 

r^o T ( V «C( Y ) - V <^ S ° Y)) = P* a.s . 



s=0 



Finally, consider the second term of the rhs o f (|49| . By Lemma 4.12 (applied 
with r = 1), E* [|V e <5 e (Y)|] < +oo. Under rQbj, the ergodic theorem (see [TJ 
Theorem 24.1, p. 314]) states that 



T-l 



lim iVV^(foY)=E t [V^(Y)] P. a.s 

T — ^oo J ' ' 



s=0 



Then, by (49) and the above discussion, 



lim ivs#£(Y)=E*[Vfl<yo(Y)] P* a.s. 
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By Lemma |4.12| applied with r = 0, 



sup \V e S e (Y)\ < 2 [<j>(Y ) + 0(y x )] + eCYJpVa , 
see 

and the rhs is integrable under the stated assumptions. Therefore, by the dom- 
inated convergence theorem. E* [Ve<5e(Y)] = VgE* [^(Y)] = Vgc+(#) . This 
concludes the proof. □ 



4.5 Contrast and the limit set £ 

Theorem 4.14. Assume iM iffl(l), I$Mfy and M Then, 9 E £ if and 

9, TV 



only if S7gc ir (9) = where c*(0) = Iimj'_ > _)_ 0o T 1 £%' T (Y) — a.s for any initial 



distribution \ on (X, £>(X)). 
Proof. For any initial distribution x, all e and all T > 0, 

Pe(xo-.T, Yi :T ) 



/ Pe( z o-.T, Yi :T )A(dz 0: T 



-A(dx :T) 



where is defined in Eq. (21)]. Under Assumption Hmlal 



log^(x 0:T , Y 1:T ) = V e m + VeViO) | i J] S^t-i, a*, Y t )| , 



and then, by definition of $£'° T (S, Y) (see ([20])) 



i V e ^;° (Y) = v e 0(0) + V^ T (0) j i J2 *o$,t(S> y 



(50) 



Under the stated assumptions, [8, Theorem 4.1] and Theorem 4.13 can be 
applied. Therefore, (50) becomes, as T — > +oo, 

V e c*(9) = V 9 ^) + V^ T W{E,[E e [5(X_ 1 ,X ,Y )|Y]]} . 



The proof follows upon noting that by definition of 9, the unique solution 
to the equation V e 0(r) + V<^ T (r) {E+ [E e [S(X^, X , Y )|Y]]} = is r = 
R(9). □ 
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5 Additional experiments 



In this section, we provide additional plots for the applications studied in [8 
Section 3]. 

5.1 Linear Gaussian model 

Figure [T] illustrates the fact that the convergence properties of the BOEM do 
not depend on the initial distribution \ used in each block. Data are sampled 
using <fi — 0.97, of t = 0.6 and a 2 — 1. All runs are started with <fi = 0.1, a\ = \ 
and al = 2. Figure [l] displays the estimation of <ft by the averaged BOEM 
algorithm with r n ~ n and r„ ~ n , over 100 independent Monte Carlo runs 
as a function of the number of blocks. We consider first the case when x is the 
stationary distribution of the hidden process i.e. x = >A/"(0, (1 — 4> 2 )^ 1 a^ L ), and 
the case when x is the filtering distribution obtained at the end of the previous 
block, computed with the Kalman filter. The estimation error is similar for both 
initialization schemes, even when <J> is close to 1 and for any choice of {r n } n >i- 

The theoretical analysis of BOEM says that a sufficient condition for con- 
vergence is the increasing size of the blocks. On Figure [2] we compare different 
strategies for the definition of r n == T n — T n -\. A slowly increasing sequence 
{jn}n>o is compared to different strategies using the same number of observa- 
tions within each block. We consider the Linear Gaussian model: 

X t+ i = 4>X t + a u U t , Y t = X t + a v V t , 

where Xo ~ Af (0, c«(l — < f )2 )~ 1 ), {Ut}t>o, {Vt}t>o are i.i-d- standard Gaussian 
r.v., independent from Xq. Data are sampled using tf> = 0.9, a\ = 0.6 and 
<t1 = 1. All runs are started with cf> = 0.1, a\ = 1 and a\ — 2. Figure [2] shows 
the estimation of cj> over 100 independent Monte Carlo runs (same conclusions 
could be drawn for a\ and a 2 ). For each choice of {t„}„>o, the median and first 
and last quartiles of the estimation are represented as a function of the number 
of observations. 

We observe that BOEM does not converge when the block size sequence 
is constant and small: as shown in Figure [2j if the number of observations 
is too small (t„ = 25), the algorithm is a poor approximation of the limiting 
EM recursion and does not converge. With greater block sizes (r n = 100 or 
r n = 350), the algorithm converges but the convergence is slower because it 
is initialized far from the true value and many observations are needed to get 
several estimations. BOEM with slowly increasing block sizes has a better 
behavior since many estimations are produced at the beginning and, once the 
estimates are closer to the true value, the bigger block sizes reduce the variance 
of the estimation. 

Moreover, our convergence rates are given up to a multiplicative constant : 
the theory says that J2 n T ™ 1 ^ < 00 where 7 is related to the ergodic behavior 
of the HMM (see assumptions r^6| . 

Even if the sequence is chosen to increase at a polynomial rate, we can have 
T n ~ c n a (a > 1) with a constant c such that the first blocks are quite small 
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(b) t„ ~ n 1 - 6 

Figure 1: Estimation of <f> after 5,10,25,50 and 150 blocks, with two different 
initialization schemes: the stationary distribution (left) and the filtering distri- 
bution at the end of the previous block (right). The boxplots are computed 
with 100 Monte Carlo runs. 



to allow a sufficiently large number of updates of the parameters {0 n ,n > 1}. 
During a (deterministic) "burn-in" period, the first blocks can even be of a fixed 
length before beginning the "increasing" procedure. 

5.2 Finite state-space HMM 

Observations are sampled using d — 6, v = 0.5, X4 = i ,Vi € {1, ...,d} and the 
true transition matrix is given by 
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50 D 1000 1500 2000 2500 3000 3500 4000 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 

Number of observations Number of observations 

(a) r n = n 1 - 1 (red) and r n = 25 (blue). (b) r„ = n 1A (red) and T n = 100 (blue). 




500 1000 1500 2000 2500 3000 3500 4000 4500 5000 
Number of observations 



(c) t„ = n (red) and r n = 350 (blue). 

Figure 2: Estimation of <f> with different block size schemes: the median (bold 
line) and the first and last quartiles (dotted line) are shown for r„ = n 11 (red), 
t„ = 100 (black) and t„ = 350 (purple). The quantities are computed with 100 
Monte Carlo runs. 

5.2.1 Comparison to an online EM based procedure 

In this case, we want to estimate the states {x\ , . . . , Xd}- AH the runs are started 
from v = 2 and from the initial states {— 1; 0; .5; 2; 3; 4}. The experiment is 
the same as the one in [5J Section 3.2.1]. The averaged BOEM is compared 
to an online EM procedure (see [5]) combined with Polyak-Ruppert averaging 
(see PHI). This online EM based algorithm follows a stochastic approximation 
update and depends on a step-size sequence {7 n } n >o which is chosen in the 
same way as in |8j Section 3.2.1]. Figure [3] displays the empirical median and 
first and last quartiles for the estimation of x-i with both averaged algorithms as 
a function of the number of observations. These estimates are obtained over 100 
independent Monte Carlo runs with t„ = n 1A and j n — rt~ - 53 . Both algorithms 
converge to the true value X2 = 2 and these plots confirm the similar behavior 
of BOEM and the online EM of [2]. 

5.2.2 Comparison to a recursive maximum likelihood procedure 

In the numerical applications below, we give supplementary graphs to compare 
the convergence of the averaged BOEM with the convergence of the Polyak- 
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(a) Estimation of 22 with averaged BOEM. (b) Estimation of X2 with averaged OEM. 



Figure 3: Estimation of X2 using the averaged online EM and averaged BOEM. 
Each plot displays the empirical median (bold line) and the first and last quar- 
tiles (dotted lines) over 100 independent Monte Carlo runs with r„ = n 1A and 
7„ = n~ ' 53 . The first ten observations are omitted for a better visibility. 



Ruppert averaged RML procedure. The experiment is the same as the one in 
Section 3.2.2]. Figure [4] and [5] displays the empirical median and first and 
last quartiles of the estimation of v and m(l,2) over 100 independent Monte 
Carlo runs. Both algorithms have a similar behavior for the estimation of these 
parameters. 
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(a) Averaged BOEM. (b) Averaged RML. 

Figure 4: Empirical median (bold line) and first and last quartiles (dotted 
line) for the estimation of v using the averaged RML algorithm (right) and the 
averaged BOEM algorithm (left). The true values is v = 0.5 and the averaging 
procedure is starter after 10000 observations. The first 10000 observations are 
not displayed for a better clarity. 

5.3 Stochastic volatility model 

Consider the following stochastic volatility model: 

X t+1 = 4>X t + aU t , Y t = f3e^V t , 
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(a) Averaged BOEM. 



(b) Averaged RML. 



Figure 5: Empirical median (bold line) and first and last quartiles (dotted line) 
for the estimation of m(l,2) using the averaged RML algorithm (right) and 
the averaged BOEM algorithm (left). The true values is m(l, 2) = 0.05 and 
the averaging procedure is starter after 10000 observations. The first 10000 
observations are not displayed for a better clarity. 

where Xq ~ J\f (0, (1 — (/) 2 ) _1 cr 2 ) and (Ut)t>o an d (Vt)t>o are two sequences of 
i.i.d. standard Gaussian r.v.. independent from Xq. Data are sampled using 
4> = 0.8. a 2 = 0.2 and /3 2 = 1. All runs are started with 4> = 0.1, er 2 = 0.6 and 
(i 2 = 2. 

In this model, the smoothed sufficient statistics {St^™^ 1 (9 n -i, Y)} n >i can 
not be computed explicitly. We thus propose to replace the exact computation 
by a Monte Carlo approximation based on particle filtering. The performance of 
the Stochastic BOEM is compared to the online EM algorithm given in [2] (see 
also [5]). To our best knowledge, there do not exist results on the asymptotic be- 
havior of the algorithms by [2 , 5j ; these algorithms rely on many approximations 
that make the proof quite difficult (some insights on the asymptotic behavior 
are given in 0). Despite there are no results in the literature on the rate of 
convergence of the Online EM algorithm by |2| we choose the step size j n in |2_ 
and the block size r„ s.t. j n = n -0 ' 6 and t„ oc r?l 2 (see [5J Section 3.2.1] for 
a discussion on this choice) . 50 particles are used for the approximation of the 
filtering distribution by Particle filtering. We report in Figure [6] the boxplots 
for the estimation of the three parameters (/?, (j>, a 2 ) for the Polyak-Ruppert 
[10] averaged Online EM and the averaged BOEM. Both average versions are 
started after 20000 observations. Figure [6] displays the estimation of <fi, a 2 and 
(3 2 . This figure shows that both algorithms have the same behavior. Similar con- 
clusions are obtained by considering other true values for (f> (such as <j> = 0.95). 
Therefore, the intuition is that online EM and Stochastic BOEM have the same 
asymptotic behavior. The main advantage of the second approach is that it 
relies on approximations which can be controlled in such a way that we are able 
to show that the limiting points of the particle version of the Stochastic BOEM 
algorithms are the stationary points of the limiting normalized log-likelihood of 
the observations. 
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(a) Estimation of rf>. 
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(b) Estimation of a 2 . 
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(c) Estimation of /3 2 . 

Figure 6: Estimation of <f>, a 2 and /3 2 using the averaged online EM algorithm 
(left) and the averaged BOEM (right), after n = {1000, lOfc, 50fc, 100fc} obser- 
vations. The true value of 6 is 0.8. 



We now compare the two algorithms when the true value of <fi is (in absolute 
value) closer to 1: we choose cj> = 0.95, 1 and a 2 being the same as in the 
previous experiment. As illustrated on Figure [7J the same conclusions are drawn 
for greater values of <j>. 
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Figure 7: Estimation of <f) using the averaged online EM algorithm (left) and the 
averaged BOEM algorithm (right), after n — {5k, 25k, 40k, 50k} observations. 
The true value of 6 is 0.95. 
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